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ABSTRACT 

An empirical sampling study investigated six 
procedures for testing differences betveen means in the presence of 
unequal n*s and variances. Support vas obtained for previcus research 
vhich found t robust to heterogeneous variances only vhen n*s are 
equal and of moderate size. The procedure vhich emerged as providing 
the best control over Type I errors while maintaining satisfactory 
pover in all test conditions vas the 6ehrens*Fisher v statistic with 
Helch's solution for degrees of freedom (df) . The general 
recommendations when the population variances are unknown are: (1) 
when n*s are the same and equal to or greater than 20 it is 
permissible to use the t statistic with df = 2n * 2, but when n is 
less than 20 use v with Kelch*s solution for df ; (2) when n^s are 
unequal use the v statistic with the Helch adjustment for df . 
(Author) 
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Procedures for Testing -4^2 ^^^^ Unequal n's and Variances 

Richard L. Kohr 
Pennsylvania Department of Education 

and 

Paul A. Games 
The Pennsylvania State University 

An empirical sampling study investigated six procedures for testing 

presence of unequal n*s and variances. Support was obtained 
for previous research which found t^ robust to heterogeneous variances only 
when n*s are equal and of moderate size. The procedure which emerged as 
providing the best control over Type I errors while maintaining satisfactory 
power in all test conditions was the Behrens*- Fisher v statistic with Welch's 
solution for df . The general recommendations when the population variances 
are unknown are: (1) when n*s are equal and > 20 it is permissible to use the 
t statistic with df = 2n - 2, but when n<20 use v with Welch's solution for 
df; (2) when n*s are unequal use the v statistic with the Welch adjustment for 
df . 
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Procedures for Testing --^2 ^^^^ Unequal n's and Variances 

Richard L. Kohr COPY AVAIUBLE 

Pennsylvania Department of Education 

and 

Paul A. Games 
The Pennsylvania State University 

Scheffe (1970, p. 1501) begins a recent article by stating "The most fre- 
quently occurring problem In applied statistics Is, In my opinion, the com- 
parison of the means of two populations,... .LetCj^ and O2 denote the popula- 
tlon variances. It Is called the Behrens- Fisher problem If the ratio 9 = ^^/^ 
Is unknown and the assumption Is added that the populations are normal. The 
normality assumption Is of no practical Importance for any of the solutions... 
based mainly on the difference of the sample means are robust against Its 
violation." Despite the fact that d is unknown in most empirical studies, 
the Behrens- Fisher problem is Ignored in many behavioral statistics books 
(Ferguson, 1966; Guilford, 1965; Glass h Stanley, 1970; Dayton, 1970; Myers, 
1966; Watt h Bridges, 1967). Instead the null hypothesis a usually 
tested by the conventional t test with df = n^ + n2 - 2 as if the assumption 
that B = 1.0 were based on something other than the experimenter* s hope. This 
article will attempt to point out that there are other practical solutions to 
this problem which do not need this dubious assumption. 

Behrens (1929) provided the first "exact" solution for this problem, which 
coincides with the Bayesian solution of Jeffreys (1940) and Savage (1961). 
Fisher (1935) extended Behrens* work and the statistic used became known as the 
Behrens-Flsher statistic, v (Winer, 1962, p. 37). Tabled probability values 
were prepared by Sukhatme (1938) and are reproduced in Fisher and Yates (1963). 
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Welch (1947) proposed an exact test for v, which was subsequently tabled 
by Aspin (1949). Welch (1949) reports little difference between these critical 
values and an approximate solution obtained by adjusting the df of t (Winer, 
1962, p. 37; Kirk, 1968, p. 98) so that, 

df = ^^^O^^^l^ , where C ^^"/ , 

df^C + df^d - cf sJ/hj + S2/n2 

The critical value of t (Ot, df ) is contrasted with v to complete the test. We 
shall label this the v-W solution. 

Several other methods have also been suggested which seek to approximate 
the sampling distribution of v. Satterthwaite, (1946, p. 114) offered a df 
adjustment where df can range from ng - 1 (where n^ is the smaller n) to 
n^ + n^ - 2. The df solutions of Welch and Satterthwaite are algebraically 
equivalent. Dixon and Massey (1957, p. 124) and Hays (1963, p. 322) present 
a formula which is a slight variation of the one given by Satterthwaite. A 
critical t^ procedure on the conservative side was devised by Cochran and Cox 
(1950, p. 92). 

Scheffe (1943) described the following solution to the Behrens-Fisher pro- 
blem. Let group 1 and 2 represent the smaller and larger groups respectively. 
The observations in each group are randomly ordered, and ni values of a new 
variate, U^, are computed by; 



= Xj^^ . in^ln^'L2^ t wherei* 1, ...nj^ 
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A ^-test that the mean of the U variate is 0.0 (or a corresponding confidence 
interval) is then conducted using the U* s as the input data. Scheffe demonstra* 
ted that the expected value of the width of 95% confidence intervals produced 
by the above method was no greater than 11% longer than the width when the 
population variances were known and n > 10. Some statisticians are unhappy with 
randomization procedures such as the above, since E*8 may obtain different 
results with the saiae set of initial data. 

Scheffe himself suggested discarding this method because he apparently 
found that some experimenters who did not like the width of the confidence 
interval obtained merely rerandomized and computed a second interval (Scheffe, 
1970, footnote 3). However, if the data is entered into a computer, with 
automation doing the randomization, the experimenter usually will be oblivious 
of this temptation. 

Another possible method is a chain logic approach. Games and Klare (1967, 
p. 49A) suggested using t if equal n*s greater than 10 are present. Otherwise 

2 Z 

F s ^l/^2 computed aiid tested &t Oi ^ .10. If this test is significant, v 
with Welch's df is used. If not, the conventional t is used. The probability 
of Type I errors, P(EI), resulting from this method may be expected to fall 
between that of and of v with Welch's adjustment to df . 

There are many other approximations to the sampling distribution of v, f^nd 
many other procedures for handling the Behrens-Fisher problem. A search of the 
mathematical statistics literature obtained over 50 references of which 27 were 
published in 1960 or later. 

Scheffe (1970) compared six solutions: the Behrens-Fisher solution; his 
1943 solution, S; the conventional _t test; the use of v with ^(o(, n - 1); the 
Welch-Aspin solution; and the v-W. His conclusions on the conventional jt 
test may shock naive users, "....this solution of the Behrens-Fisher problem 



is asymptotically incorrect for large unequal sample sizes, elementary calcula- 
tions showing that ^(0) may take on any value between 0 and 1 for emyof(0< o( < l) 
and for suitable 9 and suitably large n^ n2« The practical conclusion is that 
this solution should never be used unless nj^ and n2 are equal or nearly so 
(1970, p. 1506)." His ^(6) = P(EI) in the present paper. When nj^ = ng = n 
then the limit of P(EI) is P[|t(n - 1)|] > t(o(, 2n - 2). Thus when « = .05 
and n = 10, the limit of P(EI) is .065. For larger equal n's, P(EI) will 
deviate less fromcx. 

Scheffe ( 1970) concludes that only the use of v with tW, n - 1) as the 

— s 

critical value, the Welch«-Aspin solution, and the v-W solution are practical. 

He finds little difference between the last two. The use of v with t^ n^ - 1) 

is conservative and has a lower power than the v-W. Thus he concludes 

.Welch' s approximate t^-solution, which requires only the ubiquitous ^-tables, 
is a satisfactory practical solution of the Behrens- Fisher problem** (1970, p. 
1505). 

Wang (1971) presents results that confirm the similarity of the Welch- 

Aspin and v-W solutions, and the excellence of control of P(EI) by the v-W 

test. With n^ as small as 5 and 9 varying from .0078 to 128, he finds that 

P(EI) does not deviate from a by more than .0035. When n_ is increased to 7, 

s 

the maximum deviation drops to .0018; with n = 11, to .0005; aad to .0002 with 
larger n^ values. Thus the v-W test is supported as an excellent procedure 
whenever the population variances are unknown, and as an absolute necessity 
when the sample sizes are unequal. 

Method 

An orthogonal design was used to contrast the P(EI) and power of six 
methods across three dimensions. The condition of heterogeneity of variance 



z t 

was represented by seven levels of variance ratios (VR) : .025, ,25, 

•5, 1,0, 2,0, 4.0, and 40,0, The proportionality of n's was represented by 
four levels of sample size ratio's (NR): ^z/^l l-OO, 1,25, 1,50, and 3,00, 
Sample sizes (SS) was represented by three levels: n^ values of 4, 12, and 
24. These three dimensions were completely crossed making 7 x 4 x 3 = 84 
conditions that were investigated in the study. Six points on the power curve 
were established for each condition. This always included the condition where 

^ ^'^ (i.e., the null is true), plus five other evenly spaced points 
chosen to avoid a ceiling effect for the last condition. The six procedures 
used were: t^ referred to the Jt distribution with df = nj^ + n2 - 2; 

(2 and 3) v referred to the t distribution with Welch's (v-W) and Dixon and 
Massey's (v-DM) solutions for df; (4-) v referred to Cochran and Cox's critical 
t (v-CC); (5) chain logic approach of Games and Klare (G & K); (6) Scheffe's 
procedure (S), 

A FORTRAN IV computer program written for the IBM 360/67 generated a 
population of 9998 cases having the following parameters: = 0,0012; (T zz 1,0129; 
skewness, Yj^ = 0,002; kurtosis, = -0.0322. Each computer run involved the 
following steps: 

(1) Generation of the above population. 

(2) A sample of nj^ cases (representing group one) was drawn from the popu- 
lation. All sampling was with replacement. The normal deviate was multiplied 
by a constant representing the desired standard deviation to produce different 
variances when desired. A constant (0 to 11.0) was added to the randomly 
drawn value so that the population mean of group one might differ from that of 
group two when desired 

(3) A sample of n2 cases (representing group two) was drawn from the 
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population. The population mean of group two was fixed at zero throughout the 
study. Since the mean for group one was the only one to vary, the trui 
difference, /^^ - //^ was always positive in value. The varicmce of group two 
was manipulated in the same manner as in group one. 

(4) The sample statistics required for conducting all significance tests 
for each of the six procedures were computed. 

(5) Signif icemce tests were conducted foe all six statistical procedures 
at the .02 and .05 levels and the number of significant results were tabulated 
for each. 

(6) Steps (2) through (5) were repeated until 250 simulated experiments 
had been obtained. 

(7) The proportion of significant results at each significance level for 
each statistical procedure were punched. 

(8) Steps (2) through (7) were repeated until four blocks of 250 samples 
each wt^re drawn. 

(9) Steps (2) through (8) wer^ repeated for each value of - A^. 

X X 

(10) Steps (2) through (9) were repeated if n jt n and O^^ ^ 0^2* switching 
the desired varieuice from group two to group one. This permitted the larger 
variance to be combined with both the larger and the smaller n. 

All calculations were performed in double precision to insure the greatest* 
possible €u:curacy. The pseudo- random number generator used in the study was 
prepared by Knoble (1969). Single precision, floating point values are returned 
which are approximately serially independent and uniformly distributed on the 
unit interval. The cycle length is 2^^ - 2. Different sequences of pseudo- 
random numbers are obtained by entering the sequence at a different point. 
Statistical properties of the generator may be found in Payne, Rabung, and Bogyo 
(1969) and Lewis, Goodman, and Miller (1969). 
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RESULTS 



Case Where the Null Hypothesis is True 

For brevity, tabular results are presented for only four of the techniques. 
The critical values of Cochran and Cox are greater than or equal to those of 
the Welch approximation. The Welch critical values are greater than or equal 
to those of Dixon and Massey. Since methods 2, 3, and 4 consisted in referring 
V to these critical values, it is clear that P(EI) and power will vary system- 
atically between them. The study confirmed that the v-W solution P(EI)*s are 
closest to alpha while the v-CC method is systematically conservative, and the 
v«DM method is systematically permissive. 

Table 1 summarizes P(EI) of the v-W and the remaining competitive solutions 
for the conditions producing the greatest (nj^ s 4) and least (n2 » 24) dis- 
crepancies. Since there are 1000 observations, any proportion less than ,0365 
and greater than .0635 is significantly different from alpha at the .05 level. 

In general, the t^ statistic revealed the expected distortion to P(EI) when 
both variances and n's are un^.qual. Conservative results occur when the 
largest variance is combined with the larger 8afflple> while an excessive number 
of rejections of occur when the largest variance is paired with the smaller 
sample. As the n-ratio increases this effect also increases. It is clear 
that the t test is not robust to the homogeneity of variance assumption when 
n*8 are unequal. Note that increasing the sample size does not eliminate the 
distortion to P(EI) when t^ is used with unequal n*8. The Welch, Games and 
Klare, and Scheffe techniques all reduced the Jb*8 fluctuations, the improvement 
in control over P(EI) generally increasing with increased sample sizes. Only 
slight deviations from the theoretical .05 value occurred for these three 
methods. 
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PROPORTION OF SIGNIFICANT RESULTS 
WHEN Hq is true, 0(= .05 
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Procedures 



Small Sample Size Large Sample Size 



N-R 


V-R 


t 


v-W 


G&K 


S 


t 


t-W 


G&K 


S 


I 0 


0 025 


0Q2* 


065* 


.068* 


0S7 


-060 


.058 


.060 


.056 


1 n 










06Q 


OfiO 

. \J\J\M 


OSS 


n60 


0S6 


1 n 




. V/Q L 






06Q 


OSQ 


OS^ 


OSQ 


0S7 


1 n 


1 n 


• 




0S2 


OSl 


OAR 


OAR 


nAR 




1 n 














niR 


AIR 
. U JO 








AC^ 






AQ7 


AAQ 


A AT 


AAQ 




l.O 


40.0 


.088* 


.060 


.062 


.054 


.053 


.052 


.053 


.052 




A AO C 
U.U25 


A^O 

• Ud2 


A^O 
. UD2 




A^ 1 
. UO 1 


AO A^ 


. UfO 


AAA 


AC 1 




A O C 


AA O 

. u^2 


A AO 




AQA^ 




AAc 
. 0^5 


AAC 






U.5 




AQT 




AAT 


ACT 

. 05/ 


A^A^ 


AAO 

.00 J 


A^A 

. uou 




1 A 


• v>5 


ACA 




AAA 


Ac C 

.055 


AC^ 

. 050 


ACT 

.05/ 


ACQ 


1.23 


O A 

2.U 


. Odd* 


AC 1 

• 051 




AAA 


A^O 

.Oo2 


A AT 


A C i'l 

.05** 


ACO 


1 oc 


A A 


AO 1 4 

.081* 


A^A 

. OoO 




AcA 

.050 


ACQ 

.059 


AAQ 

. Ow 


AAQ 


.O'tb 


1 o c 


A A A 


1 AA4 

. 10'** 


ACT 

.05/ 


.058 


A CO 

• 053 


Aao^ 
.092* 


AA£ 
. OW 


AAA 
.0^0 


.0^7 


1.5 


0.025 


.031* 


.048 


.046 


.061 


.015* 


.046 


.046 


.046 


1.5 


0.25 


.033* 


.039 


.038 


.045 


.036* 


.051 


.050 


.053 


1.5 


0.5 


.034* 


.040 


.039 


.050 


.029* 


.047 


.043 


.046 


1.5 


1.0 


.043 


.039 


.043 


.051 


.047 


.048 


.048 


.049 


1.5 


2.0 


.062 


.048 


.053 


.045 


.073* 


.054 


.058 


.052 


1.5 


4.0 


.094* 


.063 


.078* 


.048 


.085* 


.048 


.048 


.049 


1.5 


40.0 


.156* 


.060 


.062 


.052 


.115* 


.050 


.050 


.047 


3.0 


0.025 


.005* 


.053 


.053 


.049 


.000* 


.041 


.041 


.048 


3.0 


0.25 


.005* 


.043 


.033 


.056 


.009* 


.042 


.041 


.057 


3.0 


0.5 


.028* 


.060 


.056 


.056 


.018* 


.047 


.043 


.063 


3.0 


1.0 


.051 


.061 


.067* 


.065* 


.055 


.058 


.053 


.063 


3.0 


2.0 


. 104* 


.053 


.091* 


.044 


.096* 


.051 


.059 


.057 


3.0 


4.0 


.145* 


.059 


.102* 


.045 


.144* 


.046 


.046 


.048 


3.0 


40.0 


.301* 


.057 


.063 


.058 


.221* 


.046 


.046 


.057 



* Represents significant deviation from Of 



Special Case of Equal Sample Sizes 

Equal sample sizes represent a special case in which prior research has 
revealed general robustness of t^ in the face of heterogeneous variances. 
Sample size, variance ratio, and the procedures of Table 1 were factors in a 
3 factor ANOVA of P(EI). 

The major significant source of variance was the procedures main effect 
which accounted for an estimated 18 per cent of the variance. The mean P(EI) 
for the Welch, Games and Klare, and Schnffe procedures were .0576, .0511, 
•0540, .0508 respectively. For the intermediate and large equal n cases all 
four procedures exercise adequate control over Type I errors. The Welch and 
Scheffe methods were the most stable cu^ross situations. 

Case of Varying Degrees of Deviation from Ho (Power) 

For a more complete comparison of the four procedures a series of power 
curves were plotted. The power curves that are graphed and discussed in the 
following sections are those of the intermediate n case. The essential 
difference between the curves of the intermediate n case and those of the 
small n case is that the curves for the Welch, Scheffe, and Games and Klare 
procedures are further apart when n's are small and become progressively 
closer as sample size increases. 

Special Case of Equal Variances 
2 2 

When O'^ populations are normally distributed, as in this study, 

then the t test is the most powerful possible test. Wlien the variance ratio 
was 1.0 and the n-ratio was 1.25 (n^^ = 12, n^ = 15) the power curve for t^ was 
indistinguishable from those for Welch's solution and the Games and Klare pro- 
cedure. As the n-ratio increased to 3.0 (n^ = 12, ng =36) some separation of 
the power curves occurred. In this region the t demonstrated the highest power 
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with the Welch solution close behind followed by the Scheffe procedure. The 
power loss by using the Welch technique did not exceed .05. 

Unequal Variances with Unequal n's 

When the larger variance occurs for the sample having the larger number 

of observations the t becomes conservative with respect to control over P(EI), 

This occurs even for small variance differences. The most extreme combination 

of unequal n's and variances was represented in the case where n s 12 and = 36 
2 2 

with (T^ = 1.0 and = 40 (VR = .025). Figure I presents the power curves for 
this situation (dashed lines). The loss of power for t is very noticeable. A 
less pronounced effect, with VR s.5 (solid lines) was found. 

The situation where the large variance occurs for the sample having the 
fewer cases effects t^ by inflating P(EI). This effect was apparent with n-ratios 
as small as 1.25. A case in point is that in which n^^ = 12, n2 = 15, as shown 
in Figure 2. When the variances were = 40.0, and O'g = 1.0, The Welch, 
Scheffe, and Games and Klare procedures are nearly indistinguishable in terms 
of power. Each offers excellent control over P(EI) and while t^ is highly inflated. 
When the n-ratio is increased to 3.0 (n^ = 12, n2 = 36) the effect on t is 
extreme, with a highly inflated risk of a Type I error and a spuriously high 
power as shown in Figure 2. This effect is not mitigated when n^s are increased 
to 24 and 72. 

When n's diverge to the extent of 3:1 variance ratios as small as 2.0 
((Tj^ = 2.0, = '■•O^ ^ readily discernible effect upon t^. Characteristi- 
cally, the revealed an inflated risk of a Type I error with correspondingly 
high power. Adequate. control over P(EI) was attained by the other test procedures. 
The Welch solution had a slight edge in power over that of Scheffe. The Games 
and Klare method was superior to both the Welch and Scheffe methods in terms of 
power, but this was at the expense of an inflated P(EI) of .066. 
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Figure L« 0( s .05 power curves when large sample is drawn from the population 
with the large variance: Comparison of curves for slight vs. large variance 
inequality when^h^ = 12, n2 = 36. Parameters: - 1.0, (7^= 2.0 (solid lines) 
and 0\ = 1.0, Q^^ - 40.0 (dashed lines). 
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!J?Wh^' *^T/°^ r'"^'' ^"''S^ **^P^^ i« d»^awn from the population 

with the small varianc^: Comparison of curves for slight vs. large inequaliJv 
of n's when ^7-, =40, (?x = i.o. Parameters- n ~ 12 « Z i"f^"*^"y 

and n - 19 i 1ft ? ijj ,V. s 12, n, = 15 (dashed lines) 

ana n^ = 12, n2 = 36, (solid lines). 
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DISCUSSION 

If the population variances are known, the classical normal curve test 
(Bloomers & Lindquist, 1960, p. 256) is the most powerful possible test and 
should be used. In practice, whenever t Is considered, the population va'^lances 
are unknown. Most of the time, there is little if any a priori basis to 
dofend the assumption that the two unknown population variances are equal. 
The notion of additive treatment effects is a simplification that does not 
necessarily exist in reality. Treatment effects may be multiplicative with 
some individual difference parameter, or produce unequal variances because of 
some other interaction with subjects. 

The present study demonstrates that the Games and Klare technique and the 
Scheffe test are superior to the conventional t test. However, the Scheffe 
test has the liability that it could produce different results when different 
randomizations are used. The Games and tClare procedure has the disadvantage 
o£ a multi-stage test. Nowhere in our study did either of these tests show any 
advantages over the v-W technique. Wherever differences were noted, they were 
in the direction of the superiority of the v-W test over all of its competitors. 

The present study (Kohr, 1970) was conducted prior to the appearance of 
the two major theoretical papers supporting the use of the v-W technique by 
Scheffg (1970) and Wang (1971). As such, we find ourselves in the embarrassing 
position of empirically demonstrating the superiority of a test that can now 
be shown to be. superior by theoretical analyses. 

The present state of knowledge suggests that the conventional t test be 

used only when nj^ - ng = n and n is moderate to large. The t is tolerated ifi 

t z 

this case only because its permissive bias (when (T'^ ft £7^) is very mild. Even 
in this case, the user would be better off using the v-W solution. However, 
when n^ = n2, then v = ^, and the Welch critical value can vary only between 
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tidy n-1) and tia, 2n - 2), Using (X = .05, when n = 21, the critical value 
could vary only between 2.086 and 2.021. In this situation, the t^ test may be 
tolerated as a good approximation to the more accurate v-W solution, since the 
area between these two points is small. For small n's, the range of critical 
values possible under the Welch solution is larger, and the complete y-W 
solution should be used. When n's are ur^equal, the v^t^, and the y-W solution 
should always be used. The Wang (1971) study considered a minimum sample 
size of 3» and the present study used a minimum size of 4. It is possible 
that with samples of only 2 or 3 cases » the y-W solution may not adequately 
control F(EI), but it is hard to conceive of any behavioral study that should 
be conducted with samples of this size. 
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